An Evaluation To Detect And Correct Erroneous Characters Wrongly Substituted, Deleted And Inserted In Japanese And English Sentences Using Markov Models
نویسندگان
چکیده
K e y words: Markov model, error detection, error correction, bunsetsu, substitution, deletion, insertion 1 I n t r o d u c t i o n In order to improve the man-machine interface with computers, the <tevelopment of input devices such as optical cha.racter tea<lets (OCR) or speech recognition devices are expected, llowew;r, it is not easy to input Japanese sentences J)y these devices, because. they are written by many kinds of characters, especially thousands of "kanji" characters. The sentences input through an OCR. or a speech recognition device usuMly contain erroneous character strings. The techniques of natural language processing are expected to find and correct these errors. tIowever, since current technologies of natural language analysis have been developed for correct sentences, they cannot directly be applied to these problems. Up to now, statistical approaches have been made to this problem. Markov mo<lels are considered to be one of" machine learning models, sinfilar to neural networks a.nd fuzzy models. They have been applied to character chains of natural lang,,a~ges (e.g.,l);nglish)[l],[2], a.nd to phoneme reco~gnition 3 . [41 cha.ins in continuous speech. . [ 1~1. ¢1' 2nd-orde.r Markov model nt bunsets',l is known to be useful to correct errors in "kanjikana." "/m nsetsu" [(;],to choose a correct syllable chain from Japa.nese syllable "bunsetsu" candidates [7], and to re(!nce the ambiguities in translation processing of non-segmented "kana." sentences into "kanji-kana" sentences [8]. The erroneous characters can be classilied Ul,O three types, lhe hrst is w~ongly recognized chal;aclers instead of correct (haracters. The second and the third are wrongly inserted and deleted (skipped) characters respectively. Markov chain mode.Is above mentioned were restricted to tind and correct the first type of errors[5],[6]. No method has been proposed for correcting errors of the second and the. third types. 'Phe. rea.son might be considered to be I.he di[ticulties of finding the error location and distinguishing between deletion and insertion
منابع مشابه
An Evaluation of a Method to Detect and Correct Erroneous Characters in Japanese input through an OCR using Markov Models
The "Selective Error Correction Method" to judge these three types of errors, and correct them, using ra-th order Markov chain model for Japanese 'kanji-kana' characters , has been proposed and shown to be useful to detect and correct errors generated randomly (Araki et al., 1994). In this paper, this method is applied to detect and correct erroneous characters in Japanese text input through an...
متن کاملMining Sequential Patterns and Tree Patterns to Detect Erroneous Sentences
An important application area of detecting erroneous sentences is to provide feedback for writers of English as a Second Language. This problem is difficult since both erroneous and correct sentences are diversified. In this paper, we propose a novel approach to identifying erroneous sentences. We first mine labeled tree patterns and sequential patterns to characterize both erroneous and correc...
متن کاملEvaluation of Genetic Diversity in Japanese and English White Quail Populations Using Microsatellite Markers
The Japanese and English White quails are widespread strains and belongs to the Galliformes order, Phasianidae family, Coturnix genus and Japonica species. These birds are likely to be well-adapted to the hard conditions and resistance to diseases as it has attained economic importance as an agricultural species. In the current study, the genetic variation of Japanese and English White quail ...
متن کاملAn Investigation into the Effective Factors in Comprehending English Garden-Path Sentences by EFL Learners
The present study aimed at highlighting the possible effects of age, proficiency level, and the structural composition of Garden-Path (GP) sentences on EFL learners' comprehension. 80 Iranian EFL learners were recruited from the initial pool of 114 participants based on the results of an English proficiency test; 40 advanced, and 40 intermediate learners were selected. Moreover, two age...
متن کاملLexical choice in Abstract Dependency Trees
In this work lexical choice in generation for Machine Translation is explored using lexical semantics. We address this problem by replacing lemmas with synonyms in the abstract representations that are used as input for generation, given a WordNet synset. In order to find the correct lemma for each node we propose to map dependency trees to Hidden Markov Trees that describe the probability of a...
متن کامل